{
 "nbformat": 4,
 "nbformat_minor": 0,
 "metadata": {
  "language_info": {
   "name": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "cells": [
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### Generating names with recurrent neural networks\n",
    "\n",
    "This time you'll find yourself delving into the heart (and other intestines) of recurrent neural networks on a class of toy problems.\n",
    "\n",
    "Struggle to find a name for the variable? Let's see how you'll come up with a name for your son/daughter. Surely no human has expertize over what is a good child name, so let us train RNN instead;\n",
    "\n",
    "It's dangerous to go alone, take these:"
   ]
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "# if you're in in colab, uncomment\n",
    "# !wget https://raw.githubusercontent.com/yandexdataschool/Practical_RL/99ae2a3dae648428edbfc41fd10ed688e5365161/week07_%5Brecap%5D_rnn/names -O names"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "# Our data\n",
    "The dataset contains ~8k earthling names from different cultures, all in latin transcript.\n",
    "\n",
    "This notebook has been designed so as to allow you to quickly swap names for something similar: deep learning article titles, IKEA furniture, pokemon names, etc."
   ]
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "import os\n",
    "start_token = \" \"\n",
    "\n",
    "with open(\"names\") as f:\n",
    "    lines = f.read()[:-1].split('\\n')\n",
    "    lines = [start_token + line for line in lines]"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "print ('n samples = ',len(lines))\n",
    "for x in lines[::1000]:\n",
    "    print (x)\n",
    "    \n"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "MAX_LENGTH = max(map(len, lines))\n",
    "print(\"max length =\", MAX_LENGTH)\n",
    "\n",
    "plt.title('Sequence length distribution')\n",
    "plt.hist(list(map(len, lines)),bins=25);"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "# Text processing\n",
    "\n",
    "First we need next to collect a \"vocabulary\" of all unique tokens i.e. unique characters. We can then encode inputs as a sequence of character ids."
   ]
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "#all unique characters go here\n",
    "tokens = <all unique characters in the dataset>\n",
    "\n",
    "tokens = list(tokens)\n",
    "\n",
    "num_tokens = len(tokens)\n",
    "print ('num_tokens = ', num_tokens)\n",
    "\n",
    "assert 50 < num_tokens < 60, \"Names should contain within 50 and 60 unique tokens depending on encoding\""
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### Convert characters to integers\n",
    "\n",
    "Torch is built for crunching numbers, not strings. \n",
    "To train our neural network, we'll need to replace characters with their indices in tokens list.\n",
    "\n",
    "Let's compose a dictionary that does this mapping."
   ]
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "token_to_id = <dictionary of symbol -> its identifier (index in tokens list)>"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "assert len(tokens) == len(token_to_id), \"dictionaries must have same size\"\n",
    "\n",
    "for i in range(num_tokens):\n",
    "    assert token_to_id[tokens[i]] == i, \"token identifier must be it's position in tokens list\"\n",
    "\n",
    "print(\"Seems alright!\")"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "def to_matrix(lines, max_len=None, pad=token_to_id[' '], dtype='int32', batch_first = True):\n",
    "    \"\"\"Casts a list of names into rnn-digestable matrix\"\"\"\n",
    "    \n",
    "    max_len = max_len or max(map(len, lines))\n",
    "    lines_ix = np.zeros([len(lines), max_len], dtype) + pad\n",
    "\n",
    "    for i in range(len(lines)):\n",
    "        line_ix = [token_to_id[c] for c in lines[i]]\n",
    "        lines_ix[i, :len(line_ix)] = line_ix\n",
    "        \n",
    "    if not batch_first: # convert [batch, time] into [time, batch]\n",
    "        lines_ix = np.transpose(lines_ix)\n",
    "\n",
    "    return lines_ix"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "#Example: cast 4 random names to matrices, pad with zeros\n",
    "print('\\n'.join(lines[::2000]))\n",
    "print(to_matrix(lines[::2000]))"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "# Recurrent neural network\n",
    "\n",
    "We can rewrite recurrent neural network as a consecutive application of dense layer to input $x_t$ and previous rnn state $h_t$. This is exactly what we're gonna do now.\n",
    "<img src=\"https://github.com/yandexdataschool/Practical_RL/blob/master/week07_%5Brecap%5D_rnn/rnn.png?raw=1\" width=480>\n",
    "\n",
    "Since we're training a language model, there should also be:\n",
    "* An embedding layer that converts character id x_t to a vector.\n",
    "* An output layer that predicts probabilities of next phoneme"
   ]
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "import torch, torch.nn as nn\n",
    "import torch.nn.functional as F\n",
    "\n",
    "class CharRNNCell(nn.Module):\n",
    "    \"\"\"\n",
    "    Implement the scheme above as torch module\n",
    "    \"\"\"\n",
    "    def __init__(self, num_tokens=len(tokens), embedding_size=16, rnn_num_units=64):\n",
    "        super(self.__class__,self).__init__()\n",
    "        self.num_units = rnn_num_units\n",
    "        \n",
    "        self.embedding = nn.Embedding(num_tokens, embedding_size)\n",
    "        self.rnn_update = nn.Linear(embedding_size + rnn_num_units, rnn_num_units)\n",
    "        self.rnn_to_logits = nn.Linear(rnn_num_units, num_tokens)\n",
    "        \n",
    "    def forward(self, x, h_prev):\n",
    "        \"\"\"\n",
    "        This method computes h_next(x, h_prev) and log P(x_next | h_next)\n",
    "        We'll call it repeatedly to produce the whole sequence.\n",
    "        \n",
    "        :param x: batch of character ids, int64[batch_size]\n",
    "        :param h_prev: previous rnn hidden states, float32 matrix [batch, rnn_num_units]\n",
    "        \"\"\"\n",
    "        # get vector embedding of x\n",
    "        x_emb = self.embedding(x)\n",
    "        \n",
    "        # compute next hidden state using self.rnn_update\n",
    "        # hint: use torch.cat(..., dim=...) for concatenation\n",
    "        h_next = ###YOUR CODE HERE\n",
    "        \n",
    "        h_next = torch.tanh(h_next)\n",
    "        \n",
    "        assert h_next.size() == h_prev.size()\n",
    "        \n",
    "        #compute logits for next character probs\n",
    "        logits = ###YOUR CODE\n",
    "        \n",
    "        return h_next, F.log_softmax(logits, -1)\n",
    "    \n",
    "    def initial_state(self, batch_size):\n",
    "        \"\"\" return rnn state before it processes first input (aka h0) \"\"\"\n",
    "        return torch.zeros(batch_size, self.num_units)"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "char_rnn = CharRNNCell()"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### RNN loop\n",
    "\n",
    "Once we've defined a single RNN step, we can apply it in a loop to get predictions on each step."
   ]
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "def rnn_loop(char_rnn, batch_ix):\n",
    "    \"\"\"\n",
    "    Computes log P(next_character) for all time-steps in lines_ix\n",
    "    :param lines_ix: an int32 matrix of shape [batch, time], output of to_matrix(lines)\n",
    "    \"\"\"\n",
    "    batch_size, max_length = batch_ix.size()\n",
    "    hid_state = char_rnn.initial_state(batch_size)\n",
    "    logprobs = []\n",
    "\n",
    "    for x_t in batch_ix.transpose(0,1):\n",
    "        hid_state, logp_next = char_rnn(x_t, hid_state)  # <-- here we call your one-step code\n",
    "        logprobs.append(logp_next)\n",
    "        \n",
    "    return torch.stack(logprobs, dim=1)"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "batch_ix = to_matrix(lines[:5])\n",
    "batch_ix = torch.tensor(batch_ix, dtype=torch.int64)\n",
    "\n",
    "logp_seq = rnn_loop(char_rnn, batch_ix)\n",
    "\n",
    "assert torch.max(logp_seq).data.numpy() <= 0\n",
    "assert tuple(logp_seq.size()) ==  batch_ix.shape + (num_tokens,)"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### Likelihood and gradients\n",
    "\n",
    "We can now train our neural network to minimize crossentropy (maximize log-likelihood) with the actual next tokens.\n",
    "\n",
    "To do so in a vectorized manner, we take `batch_ix[:, 1:]` - a matrix of token ids shifted i step to the left so i-th element is acutally the \"next token\" for i-th prediction"
   ]
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "predictions_logp = logp_seq[:, :-1]\n",
    "actual_next_tokens = batch_ix[:, 1:]\n",
    "\n",
    "logp_next = torch.gather(predictions_logp, dim=2, index=actual_next_tokens[:,:,None])\n",
    "\n",
    "loss = -logp_next.mean()"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "loss.backward()"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "for w in char_rnn.parameters():\n",
    "    assert w.grad is not None and torch.max(torch.abs(w.grad)).data.numpy() != 0, \\\n",
    "        \"Loss is not differentiable w.r.t. a weight with shape %s. Check forward method.\" % (w.size(),)"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### The training loop\n",
    "\n",
    "We train our char-rnn exactly the same way we train any deep learning model: by minibatch sgd.\n",
    "\n",
    "The only difference is that this time we sample strings, not images or sound."
   ]
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "from IPython.display import clear_output\n",
    "from random import sample\n",
    "\n",
    "char_rnn = CharRNNCell()\n",
    "opt = torch.optim.Adam(char_rnn.parameters())\n",
    "history = []"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "\n",
    "for i in range(1000):\n",
    "    batch_ix = to_matrix(sample(lines, 32), max_len=MAX_LENGTH)\n",
    "    batch_ix = torch.tensor(batch_ix, dtype=torch.int64)\n",
    "    \n",
    "    logp_seq = rnn_loop(char_rnn, batch_ix)\n",
    "    \n",
    "    # compute loss\n",
    "    <YOUR CODE>\n",
    "    \n",
    "    loss = ###YOUR CODE\n",
    "    \n",
    "    # train with backprop\n",
    "    <YOUR CODE>\n",
    "    \n",
    "    history.append(loss.data.numpy())\n",
    "    if (i+1)%100==0:\n",
    "        clear_output(True)\n",
    "        plt.plot(history,label='loss')\n",
    "        plt.legend()\n",
    "        plt.show()\n",
    "\n",
    "assert np.mean(history[:10]) > np.mean(history[-10:]), \"RNN didn't converge.\""
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### RNN: sampling\n",
    "Once we've trained our network a bit, let's get to actually generating stuff. \n",
    "All we need is the single rnn step function you have defined in `char_rnn.forward`."
   ]
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "def generate_sample(char_rnn, seed_phrase=' ', max_length=MAX_LENGTH, temperature=1.0):\n",
    "    '''\n",
    "    The function generates text given a phrase of length at least SEQ_LENGTH.\n",
    "    :param seed_phrase: prefix characters. The RNN is asked to continue the phrase\n",
    "    :param max_length: maximum output length, including seed_phrase\n",
    "    :param temperature: coefficient for sampling.  higher temperature produces more chaotic outputs,\n",
    "                        smaller temperature converges to the single most likely output\n",
    "    '''\n",
    "    \n",
    "    x_sequence = [token_to_id[token] for token in seed_phrase]\n",
    "    x_sequence = torch.tensor([x_sequence], dtype=torch.int64)\n",
    "    hid_state = char_rnn.initial_state(batch_size=1)\n",
    "    \n",
    "    #feed the seed phrase, if any\n",
    "    for i in range(len(seed_phrase) - 1):\n",
    "        hid_state, _ = char_rnn(x_sequence[:, i], hid_state)\n",
    "    \n",
    "    #start generating\n",
    "    for _ in range(max_length - len(seed_phrase)):\n",
    "        hid_state, logp_next = char_rnn(x_sequence[:, -1], hid_state)\n",
    "        p_next = F.softmax(logp_next / temperature, dim=-1).data.numpy()[0]\n",
    "        \n",
    "        # sample next token and push it back into x_sequence\n",
    "        next_ix = np.random.choice(num_tokens,p=p_next)\n",
    "        next_ix = torch.tensor([[next_ix]], dtype=torch.int64)\n",
    "        x_sequence = torch.cat([x_sequence, next_ix], dim=1)\n",
    "        \n",
    "    return ''.join([tokens[ix] for ix in x_sequence.data.numpy()[0]])"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "for _ in range(10):\n",
    "    print(generate_sample(char_rnn))"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "for _ in range(50):\n",
    "    print(generate_sample(char_rnn, seed_phrase=' Trump'))"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### Try it out!\n",
    "You've just implemented a recurrent language model that can be tasked with generating any kind of sequence, so there's plenty of data you can try it on:\n",
    "\n",
    "* Novels/poems/songs of your favorite author\n",
    "* News titles/clickbait titles\n",
    "* Source code of Linux or Tensorflow\n",
    "* Molecules in [smiles](https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system) format\n",
    "* Melody in notes/chords format\n",
    "* Ikea catalog titles\n",
    "* Pokemon names\n",
    "* Cards from Magic, the Gathering / Hearthstone\n",
    "\n",
    "If you're willing to give it a try, here's what you wanna look at:\n",
    "* Current data format is a sequence of lines, so a novel can be formatted as a list of sentences. Alternatively, you can change data preprocessing altogether.\n",
    "* While some datasets are readily available, others can only be scraped from the web. Try `Selenium` or `Scrapy` for that.\n",
    "* Make sure MAX_LENGTH is adjusted for longer datasets. There's also a bonus section about dynamic RNNs at the bottom.\n",
    "* More complex tasks require larger RNN architecture, try more neurons or several layers. It would also require more training iterations.\n",
    "* Long-term dependencies in music, novels or molecules are better handled with LSTM or GRU\n",
    "\n",
    "__Good hunting!__"
   ]
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### More seriously\n",
    "\n",
    "What we just did is a manual low-level implementation of RNN. While it's cool, i guess you won't like the idea of re-writing it from scratch on every occasion. \n",
    "\n",
    "As you might have guessed, torch has a solution for this. To be more specific, there are two options:\n",
    "* `nn.RNNCell(emb_size, rnn_num_units)` - implements a single step of RNN just like you did. Basically concat-linear-tanh\n",
    "* `nn.RNN(emb_size, rnn_num_units` - implements the whole rnn_loop for you.\n",
    "\n",
    "There's also `nn.LSTMCell` vs `nn.LSTM`, `nn.GRUCell` vs `nn.GRU`, etc. etc.\n",
    "\n",
    "In this example we'll rewrite the char_rnn and rnn_loop using high-level rnn API."
   ]
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "class CharRNNLoop(nn.Module):\n",
    "    def __init__(self, num_tokens=num_tokens, emb_size=16, rnn_num_units=64):\n",
    "        super(self.__class__, self).__init__()\n",
    "        self.emb = nn.Embedding(num_tokens, emb_size)\n",
    "        self.rnn = nn.RNN(emb_size, rnn_num_units, batch_first=True)\n",
    "        self.hid_to_logits = nn.Linear(rnn_num_units, num_tokens)\n",
    "        \n",
    "    def forward(self, x):\n",
    "        h_seq, _ = self.rnn(self.emb(x))\n",
    "        next_logits = self.hid_to_logits(h_seq)\n",
    "        next_logp = F.log_softmax(next_logits, dim=-1)\n",
    "        return next_logp\n",
    "    \n",
    "model = CharRNNLoop()"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "# the model applies over the whole sequence\n",
    "batch_ix = to_matrix(sample(lines, 32), max_len=MAX_LENGTH)\n",
    "batch_ix = torch.tensor(batch_ix, dtype=torch.int64)\n",
    "\n",
    "logp_seq = model(batch_ix)\n",
    "\n",
    "# compute loss. This time we use nll_loss with some duct tape\n",
    "loss = F.nll_loss(logp_seq[:, 1:].contiguous().view(-1, num_tokens), \n",
    "                  batch_ix[:, :-1].contiguous().view(-1))\n",
    "\n",
    "loss.backward()"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "Here's another example"
   ]
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "import torch, torch.nn as nn\n",
    "import torch.nn.functional as F\n",
    "\n",
    "class CharLSTMCell(nn.Module):\n",
    "    \"\"\"\n",
    "    Implements something like CharRNNCell, but with LSTM\n",
    "    \"\"\"\n",
    "    def __init__(self, num_tokens=len(tokens), embedding_size=16, rnn_num_units=64):\n",
    "        super(self.__class__,self).__init__()\n",
    "        self.num_units = rnn_num_units\n",
    "        self.emb = nn.Embedding(num_tokens, embedding_size)\n",
    "        self.lstm = nn.LSTMCell(embedding_size, rnn_num_units)\n",
    "        self.rnn_to_logits = nn.Linear(rnn_num_units, num_tokens)\n",
    "        \n",
    "    def forward(self, x, prev_state):\n",
    "        (prev_h, prev_c) = prev_state\n",
    "        (next_h, next_c) = self.lstm(self.emb(x), (prev_h, prev_c))\n",
    "        logits = self.rnn_to_logits(next_h)\n",
    "        \n",
    "        return (next_h, next_c), F.log_softmax(logits, -1)\n",
    "    \n",
    "    def initial_state(self, batch_size):\n",
    "        \"\"\" LSTM has two state variables, cell and hid \"\"\"\n",
    "        return torch.zeros(batch_size, self.num_units), torch.zeros(batch_size, self.num_units)\n",
    "    \n",
    "char_lstm = CharLSTMCell()"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "code",
   "source": [
    "# the model applies over the whole sequence\n",
    "batch_ix = to_matrix(sample(lines, 32), max_len=MAX_LENGTH)\n",
    "batch_ix = torh.tensor(batch_ix, dtype=torch.int64)\n",
    "\n",
    "logp_seq = rnn_loop(char_lstm, batch_ix)\n",
    "\n",
    "# compute loss. This time we use nll_loss with some duct tape\n",
    "loss = F.nll_loss(logp_seq[:, 1:].contiguous().view(-1, num_tokens), \n",
    "                  batch_ix[:, :-1].contiguous().view(-1))\n",
    "\n",
    "loss.backward()"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "__Bonus quest: __ implement a model that uses 2 LSTM layers (the second lstm uses the first as input) and train it on your data."
   ]
  }
 ]
}
